Encoder in the Transformer

Each encoder in the Transformer consists of two main parts Multihead SelfAttention followed by Add Normalize...

Gaurav

February 12, 2024

Encoder in the Transformer:🔗

Encoder Structure

Each encoder in the Transformer consists of two main parts:

Multi-head Self-Attention followed by Add & Normalize: This is the self-attention mechanism we discussed earlier, where the input sequence attends to all positions in itself. After the attention scores are computed and used to produce an output, a residual connection (i.e., the "add" operation) is added, and then layer normalization is applied.
Position-wise Feed-Forward Networks followed by Add & Normalize: This consists of two linear transformations with a ReLU activation in between. Just like in the self-attention mechanism, after the feed-forward network, a residual connection is added, followed by layer normalization.

$X$ is the matrix after positional encoding.

Equations for the Encoder's Operations:🔗

Self-Attention:

\text{Self-Attention}(X) = \text{MultiHead}(X, X, X)

Add & Normalize after Self-Attention:

\text{Output\_after\_attention} = \text{LayerNorm}(X + \text{Self-Attention}(X))

Feed-Forward:

Assuming the feed-forward network consists of two linear layers with weights $W_1$ and $W_2$ , and biases $b_1$ and $b_2$ , and using ReLU as the activation function, it can be represented as:

\text{FFN}(X) = \text{ReLU}(X \times W_1 + b_1) \times W_2 + b_2

Add & Normalize after Feed-Forward:

\text{Output\_of\_encoder} = \text{LayerNorm}(\text{Output\_after\_attention} + \text{FFN}(\text{Output\_after\_attention}))

So, the output of the encoder, after processing the input matrix $X$ (with positional encodings), is $\text{Output\_of\_encoder}$ . If there are multiple encoder layers in the Transformer, this output will serve as the input $X$ for the next encoder layer.

Encoder Process with Parameters:🔗

Encoder Process with Parameters

Word Embeddings & Positional Encoding: $\text{Sentence} \rightarrow \text{Word Embeddings (Embedding matrix)} \rightarrow \text{Add Positional Encoding} \rightarrow X$
Multi-Head Self-Attention:
- For each head $i$ : $X \xrightarrow{W_{Qi}, W_{Ki}, W_{Vi}} Q_i, K_i, V_i$
- Self-attention for each head: $Q_i, K_i, V_i \rightarrow \text{Scaled Dot-Product Attention} \rightarrow Z_i$
- Combine all heads: $\text{Concatenate } Z_i \text{ matrices} \xrightarrow{W_O} Z_{\text{combined}}$
Add & Normalize after Self-Attention: $X + Z_{\text{combined}} \rightarrow \text{Layer Normalization (with learned parameters $\gamma$ and $\beta$)} \rightarrow Y$
Position-wise Feed-Forward Network (FFN):
- FFN Parameters: $W_1, b_1$ for the first layer and $W_2, b_2$ for the second layer.
$Y \xrightarrow{W_1, b_1} \text{Linear Transformation} \rightarrow \text{ReLU Activation} \xrightarrow{W_2, b_2} \text{Linear Transformation} \rightarrow F$
Add & Normalize after FFN: $Y + F \rightarrow \text{Layer Normalization (with learned parameters $\gamma$ and $\beta$)} \rightarrow O$

Where:

$W_{Qi}$ , $W_{Ki}$ , and $W_{Vi}$ are the weight matrices for computing the Query, Key, and Value for the $i^{th}$ attention head.
$W_O$ is the weight matrix for combining the outputs of all attention heads.
$W_1, b_1$ are the weight matrix and bias for the first linear transformation in the FFN.
$W_2, b_2$ are the weight matrix and bias for the second linear transformation in the FFN.
$\gamma$ and $\beta$ are the learned scale and shift parameters for layer normalization.

The final output after one encoder layer is $O$ . If there are more encoder layers, $O$ would serve as the input $X$ for the next encoder layer.

Buy Book to read more

COMING SOON ! ! !

Till Then, you can Subscribe to Us.

Get the latest updates, exclusive content and special offers delivered directly to your mailbox. Subscribe now!

ClassFlame – Where Learning Meets Conversation! offers conversational-style books in Computer Science, Mathematics, AI, and ML, making complex subjects accessible and engaging through interactive learning and expertly curated content.